# Zero-shot Classification
Cropvision CLIP
A vision-language model fine-tuned based on the CLIP architecture, specifically designed for zero-shot classification of plant diseases
Image Classification English
C
EduFalcao
38
0
Bge Reranker V2 M3 Q5 K M GGUF
Apache-2.0
This model is converted from BAAI/bge-reranker-v2-m3 into GGUF format using llama.cpp via ggml.ai's GGUF-my-repo space, primarily for text classification tasks.
Text Embedding Other
B
pyarn
31
1
Marqo Fashionsiglip ST
Apache-2.0
Marqo-FashionSigLIP is a multimodal embedding model optimized for fashion product search, achieving a 57% improvement in MRR and recall rate compared to FashionCLIP.
Image-to-Text
Transformers English

M
pySilver
3,586
0
Drama Large Xnli Anli
A zero-shot classification model fine-tuned on XNLI and ANLI datasets based on facebook/drama-large, supporting natural language inference tasks in 15 languages.
Large Language Model Supports Multiple Languages
D
mjwong
23
0
Clip Backdoor Rn50 Cc3m Badnets
MIT
This is a pre-trained backdoor-injected model for studying backdoor sample detection in contrastive language-image pretraining.
Text-to-Image English
C
hanxunh
16
0
Gte Multilingual Base Xnli Anli
Apache-2.0
This model is a fine-tuned version of Alibaba-NLP/gte-multilingual-base on the XNLI and ANLI datasets, supporting multilingual natural language inference tasks.
Text Classification Supports Multiple Languages
G
mjwong
21
0
Gte Multilingual Base Xnli
Apache-2.0
This model is a fine-tuned version of Alibaba-NLP/gte-multilingual-base on the XNLI dataset, supporting multilingual natural language inference tasks.
Text Classification Supports Multiple Languages
G
mjwong
58
0
Clip Vit Base Patch32 Lego Brick
MIT
A CLIP-based fine-tuned model for LEGO brick image-text matching, specifically designed to recognize LEGO bricks and their descriptions.
Text-to-Image
Transformers English

C
armaggheddon97
44
0
Conceptclip
MIT
ConceptCLIP is a large-scale vision-language pre-training model enhanced with medical concepts, suitable for various medical imaging modalities, capable of achieving robust performance across multiple medical imaging tasks.
Image-to-Text
Transformers English

C
JerrryNie
836
1
Vit Large Patch14 Clip 224.laion2b
Apache-2.0
Vision Transformer model based on CLIP architecture, specialized in image feature extraction
Image Classification
Transformers

V
timm
502
0
Microsoft Git Base
MIT
GIT is a Transformer-based generative image-to-text model capable of converting visual content into textual descriptions.
Image-to-Text Supports Multiple Languages
M
seckmaster
18
0
Aimv2 Large Patch14 224 Lit
AIMv2 is a series of vision models pretrained with multimodal autoregressive objectives, demonstrating outstanding performance across multiple multimodal understanding benchmarks.
Image-to-Text
A
apple
222
6
LLM2CLIP Llama 3 8B Instruct CC Finetuned
Apache-2.0
LLM2CLIP is an innovative approach that enhances CLIP's cross-modal capabilities through large language models, significantly improving the discriminative power of visual and text representations.
Multimodal Fusion
L
microsoft
18.16k
35
RS M CLIP
MIT
A multilingual vision-language pre-trained model for the remote sensing field, supporting image-text cross-modal tasks in 10 languages.
Image-to-Text Supports Multiple Languages
R
joaodaniel
248
1
Marqo Fashionsiglip
Apache-2.0
A fine-tuned fashion multimodal retrieval model based on ViT-B-16-SigLIP, specializing in fashion product search
Text-to-Image English
M
Styld
39
3
Marqo Fashionclip
Apache-2.0
Marqo-FashionCLIP is a fashion-domain multimodal retrieval model based on the CLIP architecture, achieving state-of-the-art performance in fashion product search tasks through generalized contrastive learning.
Text-to-Image
Transformers English

M
Marqo
8,376
23
Video Llava
A large-scale vision-language model based on Vision Transformer architecture, supporting cross-modal understanding between images and text
Text-to-Image
V
AnasMohamed
194
0
Vit L 16 HTxt Recap CLIP
A CLIP model trained on the Recap-DataComp-1B dataset using LLaMA-3 generated captions, suitable for zero-shot image classification tasks
Text-to-Image
V
UCSC-VLAA
538
17
Clip ViT B 32 Vision
MIT
ONNX ported version based on CLIP ViT-B/32 architecture, suitable for image classification and similarity search tasks.
Image Classification
Transformers

C
Qdrant
10.01k
7
Bert Base Japanese V3 Nli Jsnli Jnli Jsick
A Japanese natural language inference cross-encoder trained on tohoku-nlp/bert-base-japanese-v3, supporting entailment, neutral, and contradiction judgments
Text Classification Supports Multiple Languages
B
akiFQC
51
1
Clip Japanese Base
Apache-2.0
A Japanese CLIP model developed by LY Corporation, trained on approximately 1 billion web-collected image-text pairs, suitable for various vision tasks.
Text-to-Image
Transformers Japanese

C
line-corporation
14.31k
22
Berturk Legal
MIT
BERTurk-Legal is a Transformer-based language model specifically designed for prior case retrieval tasks in the Turkish legal domain.
Large Language Model
Transformers Other

B
KocLab-Bilkent
382
6
Bert Base Japanese V3 Nli Jsnli
A Japanese natural language inference model based on BERT architecture, trained on the JSNLI dataset, used to determine logical relationships (entailment/neutral/contradiction) between sentence pairs
Text Classification Supports Multiple Languages
B
akiFQC
203
0
Roberta Base Zeroshot V2.0 C
MIT
A zero-shot classification model based on the RoBERTa architecture, designed for text classification tasks without requiring training data, supports both GPU and CPU operation, and is trained using fully business-friendly data.
Text Classification
Transformers English

R
MoritzLaurer
3,188
4
Deberta V3 Large Zeroshot V2.0 C
MIT
A DeBERTa-v3-large model specifically designed for efficient zero-shot classification, trained on fully commercially friendly synthetic data and NLI datasets, supporting GPU/CPU inference
Text Classification
Transformers English

D
MoritzLaurer
1,560
20
Kf Deberta Base Cross Nli
MIT
A Korean natural language inference model based on the DeBERTa architecture, trained on the kor-nli and klue-nli datasets, supporting zero-shot classification tasks.
Text Classification
Transformers Korean

K
deliciouscat
21
2
Tecoa4 Clip
MIT
TeCoA is a vision-language model initialized from OpenAI CLIP, enhanced with supervised adversarial fine-tuning for improved robustness
Text-to-Image
T
chs20
51
1
CONCH
CONCH is a vision-language foundation model for histopathology, pre-trained on 1.17 million pathology image-text pairs, demonstrating state-of-the-art performance in 14 computational pathology tasks.
Image-to-Text English
C
MahmoodLab
12.76k
107
Japanese Clip Vit B 32 Roberta Base
A Japanese version of the CLIP model that maps Japanese text and images into the same embedding space, suitable for zero-shot image classification, text-image retrieval, and other tasks.
Text-to-Image
Transformers Japanese

J
recruit-jp
384
9
Tinyclip ViT 39M 16 Text 19M YFCC15M
MIT
TinyCLIP is an innovative cross-modal distillation approach for large-scale language-image pre-trained models, achieving the optimal balance between speed and accuracy through affinity mimicking and weight inheritance techniques.
Text-to-Image
Transformers

T
wkcn
654
0
Fmops Distilbert Prompt Injection Onnx
Apache-2.0
This is the ONNX format conversion of the fmops/distilbert-prompt-injection model, designed for detecting prompt injection attacks.
Large Language Model
Transformers English

F
protectai
23
0
Roberta Base Nli
This model is a natural language inference model based on the RoBERTa architecture, specifically fine-tuned for depression detection tasks.
Text Classification
Transformers English

R
kwang123
18
1
Git Large Coco
MIT
GIT is a Transformer-based image-to-text generation model capable of generating descriptive text from input images.
Image-to-Text
Transformers Supports Multiple Languages

G
alexgk
25
0
Clip Vit Large Patch14
OpenAI's open-source CLIP model, based on Vision Transformer (ViT) architecture, supporting joint understanding of images and text.
Text-to-Image
Transformers

C
Xenova
17.41k
0
Echo Clip
MIT
A zero-shot image classification model based on OpenCLIP
Image Classification
E
mkaichristensen
647
9
Xlm Roberta Large Manifesto
MIT
A fine-tuned xlm-roberta-large model based on multilingual training data for zero-shot text classification, using the Manifesto Project coding scheme.
Text Classification
Transformers Other

X
poltextlab
124
0
FLIP Base 32
Apache-2.0
This is a vision-language model based on the CLIP architecture, specifically post-trained on 80 million face images.
Multimodal Fusion
Transformers

F
FLIP-dataset
16
0
CLIP Giga Config Fixed
MIT
A large CLIP model trained on the LAION-2B dataset, using ViT-bigG-14 architecture, supporting cross-modal understanding between images and text
Text-to-Image
Transformers

C
Geonmo
109
1
Pubmed Clip Vit Base Patch32
MIT
PubMedCLIP is a version of the CLIP model fine-tuned for the medical field, specifically designed to handle medical images and related text.
Text-to-Image English
P
flaviagiammarino
10.27k
19
Git Base Finetune
MIT
GIT is a Transformer-based generative image-to-text model capable of converting visual content into descriptive text.
Image-to-Text
Transformers Supports Multiple Languages

G
wangjin2000
18
0
- 1
- 2
Featured Recommended AI Models